Skip to content

ci(rust-test): mold linker on the coverage job (parity) — fixes the test-with-coverage flake; clears the SoA migration#488

Merged
AdaWorldAPI merged 3 commits into
mainfrom
claude/ci-coverage-mold-parity
Jun 13, 2026
Merged

ci(rust-test): mold linker on the coverage job (parity) — fixes the test-with-coverage flake; clears the SoA migration#488
AdaWorldAPI merged 3 commits into
mainfrom
claude/ci-coverage-mold-parity

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 12, 2026

Copy link
Copy Markdown
Owner

What this is

A two-file fix for the test-with-coverage CI flake, plus the tech-debt record. Diagnosed first-hand, not inferred — and the diagnosis clears the SoA migration of any fault.

The question this answers

"Is test-with-coverage failing because the SoA-vs-singleton migration is mid-flight spaghetti?" No.

  • A migration logic bug would fail the plain test (stable) job too. It doesn't — test is green; only the llvm-cov-instrumented variant fails, on the same test command. Instrumentation doesn't change logic.
  • The two open SoA debts are harmless to tests: TD-RESONANCEDTO-DUP-1 (P3 name-dup, user-deferred) and TD-UNBUNDLE-FROM-1 (~1 bit / 100-epoch gestalt drift). Neither crashes a test.
  • The migration plan (bindspace-singleton-to-mailbox-soa-v1, singleton-to-snapshot-nudge-v1) is clean — its codebook-vs-singleton rule is crisp and it needs no calibration on account of this.

The actual root cause — a CI job asymmetry

The test job sets up the mold linker with this comment:

Heavy lance+datafusion integration-test binaries OOM the default GNU ld at the link step (intermittent). mold links them fast + low-memory.

The test-with-coverage job did not set up mold — and it links the larger llvm-cov-instrumented binaries with the default linker, so the OOM is more likely there. Evidence: across the last 50 rust-test.yml runs, exactly 2 hit test=success / cov=failure (this branch's base + claude/nice-edison-g4rhhl); the plain test job stayed green in both. Intermittent (2/50) = memory-pressure OOM, not a deterministic bug.

The fix

Add the identical rui314/setup-mold@v1 step to the coverage job (parity with test). The action is already trusted in this repo — used by test, release.yml, and rust-publish.yml. YAML validated locally.

Honest residual (recorded in the debt entry)

  • Codecov upload already sets fail_ci_if_error: false, so this was a non-blocking job-level ❌ (mergeable=True) — cosmetic noise, not a merge gate.
  • Without the CI log (token returns 403 on actions/jobs/.../logs) a timing-race that only surfaces under instrumentation's slower execution can't be 100% excluded — but the migration's concurrency tests (D-SNGL-6 writer+reader threads) are PROPOSAL, not shipped, so there is no concurrent SoA test to race yet. If coverage still fails after mold → escalate to the race hypothesis (read the log with a scoped token). That escalation path is written into TD-CI-COVERAGE-MOLD-1.

Board

TD-CI-COVERAGE-MOLD-1 recorded in TECH_DEBT.md (Open → paid-by this PR; confirm on next green coverage run). Per the Mandatory Board-Hygiene Rule, the debt observation lands in the same commit as the fix.

https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY

Summary by CodeRabbit

  • Bug Fixes

    • Fixed intermittent CI coverage test failures by aligning linker setup and reducing instrumented build size.
  • Chores

    • Optimized CI coverage job configuration to improve build stability and reliability.
  • Documentation

    • Added a technical-debt entry documenting the diagnosis, repro, and mitigation steps for the CI coverage flakiness.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 58 minutes and 54 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 55595d42-9c7d-40c5-94c7-d9432ddbcd33

📥 Commits

Reviewing files that changed from the base of the PR and between 7f98d23 and defd290.

📒 Files selected for processing (1)
  • .github/workflows/rust-test.yml
📝 Walkthrough

Walkthrough

The CI test-with-coverage job is modified to install a pinned mold linker and set RUSTFLAGS=-C debuginfo=0; a tech-debt entry documents the link-step OOM diagnosis and the applied fixes.

Changes

CI Coverage Test Job Resource Fix

Layer / File(s) Summary
Coverage job configuration updates
.github/workflows/rust-test.yml
Pins rui314/setup-mold for the main test job to a commit SHA, sets RUSTFLAGS=-C debuginfo=0 at the test-with-coverage job level, and adds a Setup mold linker step in test-with-coverage using the pinned action.
Coverage failure issue and resolution documentation
.claude/board/TECH_DEBT.md
Adds TD-CI-COVERAGE-MOLD-1 recording the missing mold linker in test-with-coverage, local reproduction of disk/RSS exhaustion at link time, measured size impact, confirmation tests pass when linking succeeds, and the remediation (job-level mold setup + debuginfo=0).

Possibly related PRs

  • AdaWorldAPI/lance-graph#451: Also modifies .github/workflows/rust-test.yml to add or rely on rui314/setup-mold to avoid linker OOM flakiness.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I dug through builds at night,
found mold would keep links light.
Debuginfo trimmed, jobs now sing—
CI hums, the rabbits spring! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change (adding mold linker to coverage job for parity) and its outcome (fixes flake, clears SoA migration), matching the core objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

claude added 2 commits June 12, 2026 22:26
… + TD-CI-COVERAGE-MOLD-1

Diagnosis (grounded, not inferred): the test-with-coverage job intermittently
failed (2/50 recent runs) while the plain test job stayed green on the SAME
test command. Root cause is NOT the SoA-singleton migration and NOT a logic
bug -- a logic bug would fail the plain test job too. The cause is a CI
asymmetry: the `test` job sets up the mold linker (with a comment that the
heavy lance+datafusion binaries OOM the default GNU ld at link), but the
`test-with-coverage` job did not -- and it links even LARGER llvm-cov
instrumented binaries with the default linker, so the OOM is more likely there.

Fix: add the identical mold setup step to the coverage job (the action is
already trusted -- used by the test job, release.yml, rust-publish.yml).

Board: TD-CI-COVERAGE-MOLD-1 recorded (Open, paid-by this PR, confirm on next
green coverage run). The entry explicitly records that the SoA migration plan
(bindspace-singleton-to-mailbox-soa-v1) needs NO calibration on account of
this -- the coverage failure is orthogonal infra noise, fail_ci_if_error:false
already keeps it non-blocking, and the honest residual (timing-race not 100%
excluded without the 403'd log) is noted with its escalation path.

https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
…COVERAGE-MOLD-1, second ceiling found

Local reproduction with CI's exact flags (debuginfo=1, x86-64-v3,
CARGO_INCREMENTAL=0) confirms the diagnosis and sharpens it:

- The --tests --no-run build died 3x at link with CI's exact opaque
  signature: rustc-LLVM 'IO failure on output stream', ld killed by
  SIGBUS, 'could not compile ... (exit status: 101)'. Resource
  exhaustion at link — never a compile or test error.
- Measured: 17 integration-test binaries x ~930 MB at debuginfo=1
  (~252 MB at debuginfo=0, -73%). Set + deps + instrumentation +
  profraw lands exactly on a hosted runner's disk/RSS budget — a
  cliff edge, which is what a 2/50 intermittent looks like. TWO
  ceilings: GNU-ld RSS (mold fixes) AND disk (mold does not).
- No test bug: every binary that linked was executed — 98/98
  integration tests pass on lance 7.0.0. The SoA exoneration in the
  debt entry is now empirical.
- debuginfo=0 is coverage-safe, verified: 600/600 contract tests under
  '-C instrument-coverage -C debuginfo=0'; __llvm_covmap +
  __llvm_prf_* sections present; .profraw emitted. Coverage mapping
  is not DWARF.

Fix: job-level RUSTFLAGS '-C debuginfo=0 -C target-cpu=x86-64-v3' on
test-with-coverage only (test job keeps workflow-level debuginfo=1).
Mold stays from the parent commit. Note: job-level RUSTFLAGS gives
the coverage job its own Swatinem cache key; first run repopulates.

https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/rust-test.yml:
- Around line 137-144: The workflow currently uses the tag-pinned action
reference "uses: rui314/setup-mold@v1" which exposes the job to tag-retargeting;
update that action reference to a specific commit SHA (e.g., replace "`@v1`" with
"@<commit-sha>") so the mold setup action is SHA-pinned and immutable; ensure
you pick a known-good commit SHA from the rui314/setup-mold repo and replace the
uses line accordingly to remove tag-pinning risk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e335dbe9-cbac-4a56-b809-70497276dddb

📥 Commits

Reviewing files that changed from the base of the PR and between 5363f43 and 4491675.

📒 Files selected for processing (2)
  • .claude/board/TECH_DEBT.md
  • .github/workflows/rust-test.yml

Comment thread .github/workflows/rust-test.yml Outdated
@AdaWorldAPI AdaWorldAPI force-pushed the claude/ci-coverage-mold-parity branch from 4491675 to b56bb2c Compare June 12, 2026 22:28
…uses)

Replace 'uses: rui314/setup-mold@v1' with the resolved commit SHA
9c9c13bf4c3f1adef0cc596abc155580bcb04444 in both occurrences (test
job + test-with-coverage job).

CodeRabbit flagged line 144 only; the test job's existing pin at
line 59 carries the identical tag-retargeting risk for the same
action, so SHA-pin both for consistency. Other tag-pinned actions
in this workflow (actions/checkout, Swatinem/rust-cache,
taiki-e/install-action, codecov/codecov-action) are pre-existing
in main and out of scope for this PR.

https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants